Automatic Term and Collocation Extraction from English-Croatian corpus

نویسنده

  • Sanja Seljan
چکیده

Term and collocation bases represent valuable additional resources covering specific domain and frequently expressions, which then can be used in further research. The paper presents possible model of building terminology and collocation base, using statistical and linguistic approaches in order to gain experience in building of such resources for the English Croatian language pair. The aim of the paper is not to evaluate tools, but to give an insight into use of tools and to gain experience in building, training and testing of language resources. In the paper, two types of statistically-based term and collocation bases have been compared, created out of the legislative documentation and then filtered through language dependant linguistic patterns.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of Classification Algorithms and Features for Collocation Extraction in Croatian

Collocations can be defined as words that occur together significantly more often than it would be expected by chance. Many natural language processing applications such as natural language generation, word sense disambiguation and machine translation can benefit from having access to information about collocated words. We approach collocation extraction as a classification problem where the ta...

متن کامل

Automatic Corpus-Based Extraction of Chinese Legal Terms

This paper reports on a study involving the automatic extraction of Chinese legal terms. We used a word segmented corpus of Chinese court judgments to extract salient legal expressions with standard collocation learning techniques. Our method takes the characteristics of Chinese legal terms into account. The extracted terms were evaluated by human markers and compared against a legal term gloss...

متن کامل

Automatic Term Extraction from Knowledge Bank of Economics

KB-N is a web-accessible searchable Knowledge Bank comprising A) a parallel corpus of quality assured and calibrated English and Norwegian text drawn from economic-administrative knowledge domains, and B) a domain-focused database representing that knowledge universe in terms of defined concepts and their respective bilingual terminological entries. A central mechanism in connecting A and B is ...

متن کامل

Extracting terms and terminological collocations from the ELAN Slovene-English pazrallel corpus

In many scientific, technological or political fields terminology and the production of upto-date reference works is lagging behind, which causes problems to translators and results in inconsistent translations. Experience gained in various projects involving parallel corpora show that automatic extraction of terms and terminological collocations is an achievable goal, however methods and techn...

متن کامل

Computational Metalexicography in Practice - Corpus-based support for the . . .

Computational Metalexicography in Practice { Corpus-based support for the revision of a commercial dictionary Abstract In a cooperation between dictionary publishers and computational linguists, raw material for the revision of the German part of a bilingual German ! English dictionary (Langenscheidts Handww orterbuch Englisch, Neubearbeitung 1991) was produced. In a case study, the entries for...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009